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1. INTRODUCTION 

Bayesian variable selection is very widely applied, with a rich literature on alternative priors 
and computational methods. For a recent review of Bayesian variable selection methods, 
refer to O'Hara and Sillanpaa (2009). Most of the literature has focused on Gaussian lin- 
ear regression models, with common methods including stochastic search variable selection 
(SSVS) (George and McCulloch, 1993; 1997), reversible jump MCMC (Green, 1995) and 
adaptive shrinkage (Tibshirani, 1996; Park and Casella, 2008; Yi and Xu, 2008). Such meth- 
ods can be applied directly for kernel or basis function selection in nonlinear regression with 
Gaussian residuals (Smith and Kohn, 1996) and can be adapted to accommodate generalized 
linear models with outcomes in the exponential family (Raftery and Richardson 1993; Meyer 
and Laud 2002). 

It is well known that Bayesian variable selection can be sensitive to the prior, and there is 
an increasingly rich literature showing asymptotic properties providing support for carefully- 
chosen priors, such as mixtures of g-priors (Zellner and Siow, 1980; Liang et. al., 2008), with 
such priors also having appealing computational properties. This literature is essentially 
entirely focused on Gaussian linear regression models, and the emphasis of this article is on 
developing methods that generalize this work to semiparametric regression models having 
unknown residual distributions. 

To set the stage, first consider the well-studied problem of comparison of linear models 
of the following type: 



Mi : Y 



n 



al n + X yi /3 yi + ei, e± ~ N{0 




M 2 : Y 



n 



al n + X T2 (3 y2 + e 2 , e 2 ~ N(0, r I n ), 



(1) 
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where Y n is nx 1 vector of responses, a is the common intercept, X 7j is a xixpj design matrix 

(j =1 ,2) excluding the column of intercepts, and e/s are Gaussian residuals, j=l,2. The 

models may or may not be nested, and the number of candidate predictors is p. Among 

numerous model selection criteria available for such comparisons, the Bayes factor (Kass 

and Raftery, 1995) has received substantial attention as the most widely accepted Bayesian 

measure of the weight of evidence in the data in favor of one model over another. The Bayes 

factor for comparing Mi versus M 2 based on a sample Y n is defined as BF™ 2 = ^[yn|^] , 

the ratio of marginal likelihoods under Mi and M 2 . Assuming one of the models under 

p 

comparison is true, Bayes factor consistency refers to the phenomenon where BF™ 2 — > oo as 

p 

n — > oo under Mi and BF™ 2 — > as n — > oo under M 2 . A stronger form of consistency is 
also possible when the convergence happens almost surely. When comparing the true model 
pairwise to each model in a list, Bayes factor consistency typically implies that the posterior 
probability on the true model goes to one. 

Although priors most commonly used in practice assume a priori independence in the 
elements of the coefficient vectors {fii and /3 2 ), priors that have been shown to result in Bayes 
factor consistency typically incorporate dependence. Examples include the intrinsic prior 
(Berger and Pericchi, 1996; Moreno, Bertolino and Racugno, 1998), and Zellner's g-prior 
(Zellner, 1986) specified by f3j ~ iV(0, gT~ 1 (X' j Xj)' 1 ), j=l,2. The intrinsic priors have proven 
to behave very well for multiple testing problems (Casella and Moreno, 2006). Zellner's g- 
prior allows for a convenient correlation structure and can control for the amount of prior 
information relative to the sample through only one hyperparameter g. Among others, 
Fernandez et al. (2001) investigated Bayes factor consistency under various choices of fixed 
g, which was allowed to depend on the sample size and/or the number of candidate predictors. 



In order to resolve difficulties associated with a fixed choice of g, such as Bartlett's paradox 
(Bartlett, 1957; Jeffreys 1961) and information paradox (Zellner 1986; Berger and Pericchi 
2001), Zellner and Siow (1980) placed an inverse-gamma prior on g, while Liang et. al. (2008) 
extended the idea of Strawderman (1971) to the regression context by proposing hyper-g and 
hyper-g/n priors on g, under which they established Bayes factor consistency. The above 
approaches entail specifying improper priors on common model parameters and proper priors 
on model parameters unique to any one model, which results in a prior specification for the 
more complex model depending upon the simpler model. To avoid such pitfalls, Guo and 
Speckman (2009) adopted the idea of Marin and Robert (2007) and placed mixtures of 
g— priors on all the elements of both j3i and fa, which leads to tractable Bayes factors as 
well as Bayes factor consistency. 

There has also been a growing interest in model selection procedures for normal linear 
models when the number of candidate predictors (p) increase with sample size (n). Such 
increases occur in a wide variety of applications, such as in nonparametric regression when 
the number of candidate kernels or basis functions depends on n. Shao (1997) analyzed the 
consistency of several frequentist and Bayesian approximation criteria for model selection 
in normal linear models with increasing model dimensions, assuming the true model to be 
the submodel minimizing the average squared prediction error. Moreno et. al. (2010) 
examined consistency of Bayes factors and the BIC under intrinsic priors for nested normal 
linear models, when the dimension of the parameter space increases with the sample size. 
Jiang (2007) considered Bayesian variable selection in generalized linear models in p > n 
settings and provided conditions to obtain near optimal rates of convergence in estimating the 
conditional predictive distribution, but did not consider asymptotic properties in selecting 
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the important predictors. 

To our knowledge, this area has entirely focused on parametric models with a particular 
focus on normal linear regression. Such a parametric assumption on the residual error is 
rather stringent and may not hold in practice, thus invalidating the earlier assumption of 
the true model belonging to the class of models under comparison and potentially leading 
to inconsistent Bayes factors. In Section 5, simulations illustrate that when residuals are 
generated from a bimodal distribution, Bayesian variable selection under a Gaussian linear 
regression model tends to have poor performance. With this motivation, our focus is on 
developing Bayes variable selection methods that do not require Gaussian residuals and that 
can be shown to be consistent. 

There is a limited literature on variable selection in Bayesian regression models having 
unknown residual distributions. Kuo and Mallick (1997) consider an accelerated failure time 
model for time-to-event data containing a linear regression component and a mixture of 
Dirichlet processes for the residual density. To perform variable selection, they add indicator 
variables to the regression function and implement an MCMC algorithm. Also, in the survival 
analysis setting, Dunson and Herring (2005) proposed a Bayesian approach for selecting 
predictors in a semiparametric hazards model that allows uncertainty in whether predictors 
enter in a multiplicative or additive manner. Kim, Tadesse and Vannucci (2006) instead 
define a Bayesian variable selection approach, which uses a Dirichlet process to define clusters 
in the data, while updating the variable inclusion indicators using a Metropolis scheme. 
Mostofi and Behboodian (2007) models a symmetric and unimodal residual density using a 
Dirichlet process scale mixtures of uniforms, while conducting Bayesian variable selection. 
Chung and Dunson (2009) modeled the conditional response density given predictors using a 
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flexible probit stick-breaking mixture of Gaussian linear models, allowing variable selection 
via a Bayesian stochastic search method. 

These articles focused on defining methodology and computational algorithms, but with- 
out study of theoretical properties, such as consistency. In fact, to our knowledge, there has 
been no previous work on consistent Bayesian variable selection in semi-parametric models, 
though there is recent work on consistent non-parametric Bayesian model selection (Ghosal, 
Lember and van der Vaart, 2008 among others). It is not straightforward to apply such 
theory directly to the problem of variable selection in semiparametric linear models. 

With this motivation, we define a practical, useful and general methodology for Bayesian 
variable selection in semiparametric linear models, while providing basic theoretical support 
by showing Bayes factor and variable selection consistency. We accomplish this by gen- 
eralizing the methods and asymptotic theory for mixtures of g-priors to linear regression 
models with unknown residuals characterized via Dirichlet process (DP) location mixture of 
Gaussians. We propose a new class of mixtures of semi-parametric g-priors, which results in 
consistent Bayesian variable selection even when there are many more candidate predictors 
(j>) than samples (n) as long as the prior assigns probability zero to models having greater 
than or equal to n predictors. Additionally, posterior computation for the proposed method 
is straightforward via an SSVS algorithm. 

Section 2 develops the proposed framework. Section 3 considers asymptotic properties. 
Section 4 outlines algorithms for posterior computation. Section 5 contains a simulation 
study. Section 6 applies the approach to a type 2 diabetes data example, and Section 7 
discusses the results. 
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2. MIXTURES OF SEMIPARAMETRIC ^-PRIORS 
2.1 MODEL FORMULATION 

In this section, we propose a new class of priors for Bayesian variable selection in linear 
regression models with an unknown residual density characterized via a Dirichlet process 
(DP) location mixture of Gaussians. In particular, let 

Vi = x'~f,il3~/ + e<~/,i = l,...,n, 
/(•) = J N{--a 1 T- 1 )dP{a) ) P~DP(mP ), P = N(0, r" 1 ), (2) 

where 7 = (7 1 ,...,7 P ) / G T is a vector of variable inclusion indicators, with 7- J =I(jth 
predictor is included in the model) and Y^=i T 7 = P71 Pi = {Pj '■ 1^ = l>j = ^t--iV}i 
x 7ji = {xij : = 1, j = 1, . . . , p} G X and does not include an intercept, and / is a density 
with respect to Lebesgue measure on 9ft. For simplicity, we model / as having an unknown 
mean instead of including an intercept a as in ([T]). The number of candidate predictors p 
may or may not increase with the sample size n. We can address the prior uncertainty in 
subset selection by placing a prior on 7, while the prior on /3 7 characterizes prior knowledge 
of the size of the coefficients for the selected predictors. 

The DP mixture prior on the density / induces clustering of the n subjects into k groups, 
with each group having a distinct intercept in the linear regression model. Let A denote an 
n x k allocation matrix, with Ay = 1 if the ith subject is allocated to the jth cluster and 
otherwise. The jth column of A then sums to rij, the number of subjects allocated to 
cluster j, with Y2 > j=i n j = n - Following Kyung, Gill and Casella (2009), conditionally on the 
allocation matrix A, ^ can be represented as the linear model 

Y n = Ar 1 + X 1 (3 1 + e, rj ~ N(0, T~ l I k ), e ~ N(0, r^), (3) 
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where X 7 = (x 7si , i = 1, . . . , n)' . 

In keeping with the mixtures of g-priors literature, we would like the prior on the regres- 
sion coefficients to retain the essential elements of Zellner's g-prior, but at the same time to 
be suitably adapted to reflect the semi-parametric nature of the model in question - more 
specifically, the clustering of responses by the DP kernel mixture prior. To this effect, we 
propose a mixture of semi-parametric g-priors which is constructed to scale the covariance 
matrix in Zellner's g-prior to reflect the clustering phenomenon as follows: 

7r(/3 7 ) = iV(0,gr- 1 (X;S A 1 X 7 )- 1 ), S A = / + AM, g ~ it{g). (4) 

Prior Q inherits the advantages of the traditional mixtures of g-priors including compu- 
tational efficiency in computing marginal likelihoods (conditional on A) and robustness to 
mis-specification of g. In addition, the prior can be interpreted as having arisen from the 

— 1/2 

analysis of a conceptual sample generated using a scaled design matrix E A X 7 , reflecting 
the clustering phenomenon due to the DP kernel mixture prior. Moreover, the proposed prior 
leads to Bayes factor and variable selection consistency in semi-parametric linear models (j2J), 
as highlighted in the sequel. 

Note that since (X'XT^ X 7 ) _1 > (X 7 X 7 ) _1 for any allocation matrix A, the prior variance 
of Y conditional on (A,g, r -1 ) is higher for the semi-parametric g— prior as compared to the 
traditional g— prior. To assess the influence of A on the prior for /3 7 , we did simulations 
which revealed that for fixed (n, p), var(/3fc) increases but the co\((3k, A) decreases as the 
number of underlying clusters in the data increase (k, I — 1, . . . ,p, k ^ I). This suggests that 
as the number of clusters increase, the components of are likely to be more dispersed with 
decreasing association between each other. 
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2.2 Bayes Factor in Semiparametric Linear Models 

Throughout the rest of the paper, we will assume that the data Y n =(Y\, . . . , Y n )' are gener- 
ated from the true model M.t '■ Y n = X 71 f3 71 + e , with q i.i.d. from the true residual density 
fo, which is a density on 3ft with respect to Lesbesgue measure. For modeling purposes, we 
put a DP location mixture of Gaussians prior on the unknown fo- For pairwise comparison, 
we evaluate the evidence in favor of Ai\ compared to Ai 2 using Bayes factor, where 

Mi : Y n = X 7l /3 7l + e 1; e u ~ / 

M 2 : Y n = X 72 /3 72 + e 2 , e 2l ~ / 

/(•) = J N(-,a,T- l )dP(a), P~DP(mP ), P = N(0,r- 1 ) 

/3 7j ~ tt(/3 7j ),j = 1,2, ttCt- 1 ) a 1/r" 1 , ^~7r(p), (5) 

where 7j indexes models of dimension pj in the model space M. (j=l,2) and 7r(/3 7 ) is defined 
111 Q. Our prior specification philosophy is similar to the one adopted by Guo and Speckman 
(2009) for normal linear models, in that we assign proper priors on all elements of both /3 7l , /3 72 
conditional on (g,r~ l ), and an improper prior on r -1 (for a more objective assessment). 
However unlike Guo and Speckman (2009), our focus is on Bayesian variable selection in 
semi-parametric linear models. 

Note that the likelihood of the response after marginalizing out 77 in ([3]) turns out to 
be h{Y n \A^ 1 ,r- 1 ) = N(X 1 ^ 1 ,t- 1 E a ) (Kyung et. al, 2009). Thus conditional on A, 
Z A = S A 1/2 F n ~ X(S A 1/2 X 7 /3 7 ,r- 1 J n ), and we are in the normal linear model set-up: 

Z A = X A ,f3, + e, e ~ N(0, r~ x I n ), vr(/3 7 ) = N(0, gr^^X^)- 1 ), (6) 

~ —1/2 

where Xa^ = S A X 7 . Under a mixture of semi-parametric g-priors, we can directly use 
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expression (17) in Guo and Speckman (2009) to obtain for j = l,2 



L(Z A \Mj) = L(Y n \A,Mj) oc (Z' A Z A )~ n / 2 / (1 + g)^' 2 

Jo 



1 - 



g Z' A H A jZ A 
1 + 9 Z' A Z A 



n(dg), (7) 



where H A j = X Anj (X' A X Ailj )~ X' A , the equivalent of a hat matrix in standard linear 
regression. Also, marginalizing over all possible subcluster allocations for a given sample 
size n, the following form for the marginal likelihood can be obtained (Kyung et. al., 2009): 

T(m) 



L(Y n \Mj) 



T(m + n 



E Wnn^Y^M,) = J2 vnHZ^Mj), (8) 

k=l AeA k i=l Ai&C n 

where Ak is the collection of all possible nxk matrices corresponding to different allocations 
of n subjects into k subclusters, C n is the collection of all possible allocation matrices for a 
sample size n with YIaeC w i = 1- m ^ ne limiting case as n — > oo, we have as the class 
of 'limiting allocation matrices'. Using (J7l), the Bayes factor in favor of A^2 conditional on 
the allocation matrix A is given by 



BF 2l,A 



L{Z A \M 2 ) /o°°( 1 + 



-Pa/2 



1 l+g^A,2 



-n/2 



ir(dg) 



L{Z A \Mi 



1 2_K>2 

1 l+g U A,l 



-n/2 



(9) 



ir(dg) 



where R A j = Z A H A jZ A /Z' A Z A , (j=l,2). Finally using (8), the unconditional Bayes factor 
marginalizing out A in favor of Ai 2 is 



BR 



21 



L(Y n \M 2 ) = E Al ec n w i L (Z Al \M 2 ) 



(10) 



3. ASYMPTOTIC PROPERTIES 

In this section we focus on asymptotic properties including Bayes factor and variable selection 
consistency. Before proceeding, note that the standard assumptions made for establishing 
Bayes factor consistency in linear models ([!]) are: 
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(Af) linw Wi^hl ^ bl > o under M 1; 

(A^J If M l <l M 2 , < lim^oo ^A^llhl ^ b2j o < 6 2 < fe x under M x , 

where if 2 = X^X'^Xy^^X^, the matrix in M 2 . A necessary condition for assumption 

(Al' ) to hold is that X 7l has full rank, which is likely to be satisfied for fixed model dimensions 

but can not be guaranteed for increasing model dimensions without further assumptions. 

Conditional on the limiting allocation matrix AG C^, we make similar assumptions. For 

fixed pj and conditional on Ag Coo, we assume 

(Al) lim™ ^ b AA > under Ml 

(A2): If Mi % M 2 , lim M0O gk^m^^ ^ ^ g [0> &A1 ) under M x . 

For pj = 0(n a: >) (j = l,2) with < a\ < a 2 < 1, conditional on AG Coo we assume (Al) and 

0*;: If M, % M 2 , linw gkf^m^^ ^ ^ g (0j 6a i) under Mi 

Note that (Al)=>(Af) (as X^ 1 ^ < X 7 X 7 ), so that assumption (Al) is a stronger version 
of (Af). Further, for the two extreme cases when A=l n and A=I n , (Af )=>(A1). To see 
this, note that X. ' S^!_ ln J 71 ~ X 7i X 7l — nX' X 71 , where X 71 is a 1 x p vector containing 
the column means of X 7l . This implies ( ' X71 T ' A = 1 ^ Xl1 ^ 71 ~ ^-n^ x n^-n)^ > (plugging 
in X 7 for X 7 in (Af)), where X 7i is the centered version of X 7l such that V n X° = 0i xp . 
On the other hand for A=I n , we have 71 A — — = \ \ 71 > 0. Assumptions (A2), (A2) 
can be interpreted as a positive 'limiting distance' between two models corresponding to 
design matrices X 71 and X 72 in ^ conditional on Ag Cx,, after marginalizing out 77, i.e. 
A.2i,a = lim^oo — A ' 71< ' " nr -f' 2 ' > A,7l/371 = bA '^A,2 e (0,00). Such a 'limiting distance' 
(A 2 x ) a) can be considered as a natural extension of the definition of distance between two 
normal linear models in Casella et. al. (2009) and Moreno et. al. (2010). 
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The following lemma gives the limits of quantities such as R? A • = Z' a H A jZ a /Z' a Za 
(Ag Cqo, j=l,2), which would be useful for establishing asymptotic properties. The proof 

— 1/2 

follows directly from the fact that conditional on allocation matrix A, Za = S A Y n ~ 
N (X A a fi iT' 1 I n ) , and using Lemmas 1 and 2 of Guo and Speckman (2009). From here on, 
we shall make all probability statements under the model Mi as defined in (|5]). 

Lemma 1. Suppose assumptions (Al), (A2) and (A2) hold. 

(i) If Mi C M 2 , then conditional on Ae C^, R A1 ^4' T -i+l A - , R A 2 ^4' T - b i+£ x ■ 

(ii) If Mi % M 2 , then conditional on Ae C^, R 2 A l a A r -\+l , R\ 2 ~* t-i+'L i ■ 

Although the next result establishes Bayes factor consistency in semi-parametric linear 
models ^ under the class of proper priors for g, the result can be extended to improper 
priors 7t(g) oc For fixed p, p\ can be greater or less than p 2 , while for increasing p 

we compare models with pi = 0(n ai ), p 2 = 0(n a2 ) and < ai < a 2 < 1, which involves 
the special case of fixed pi but increasing p 2 . As elaborated in Guo and Speckman (2009), 
the class of priors ir(g) considered here include hyper-g + g)~ a ^ 2 ) arid hyper-g/n 

(tt^(1 + g/n)~ a l 2 ) priors, with 2 < a < 4 (Liang et. al. 2008), Zellner-Siow and beta-prime 
priors. Let the notation a n « b n imply that lim^oo a n /b n > almost surely. We assume 
the following conditions on 7r(g): 

(A3): There exists a constant k> such that J a C ° a ™ ^(dg) f=s n~ k for any constant Co > 1 and 
any sequence a n ~ n. 

(AJf.): There exists a constant k u such that k-(p 2 — pi)/2 < k u < k and L (l+g) ku 7r(dg) w 1. 
Assumption (^45*^) ensures that the prior mass for the tail decreases exponentially fast, which 
is a weak condition and quite reasonable. 
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Theorem I. Suppose assumptions (Al), (A2) and (A2) hold . 

(I) Suppose p\ and p 2 are fixed. If M.\ C M. 2 , then under M.\ and assumptions (A3), (A4), 
BF^ 1 4 as oo and if p 2 - Pl > 2 + 2(k - k u ), BF% 1 as n— > oo. Further, if 
A4i ^ Ai 2 , then under Aii and assumption (A3), BF 2 l 1 ^4' as n— > oo. 

(II) Suppose pi = 0{n ai ) and p 2 = 0{n a2 ), with < a± < a 2 < 1. Then under M.\ and 
assumption (A3), BF r 2 \ ^4' as n— > oo. 

REMARK 1. The above result can be easily extended to improper priors 7r(g) oc -A^. 

In settings in which there are not two models under consideration but many, often it is 
of interest to see if the posterior model probability P(Mi\Y n ) goes to 1 as n— y oo. The next 
theorem gives such a result making use of a sequence of prior model probabilites depending 
on n and assuming that the growth rate of Aii is known, for increasing model dimensions. 

Theorem II. Suppose the conditions of Theorem I hold. For fixed p and under M.\, 
P(M.\\y n )—± 1 for any {7r(A^ 7 ) : 7 G r,7r(A^i) > 0}. For increasing p (> pi) and under 
Mi, P(Mi\Y n ) ->■ l/or{7r n ( 7i; ) oc 2~ p ^ 2 I[ Pj < 0(n ai )} +Nj 1 I[0{n ai ) < Pj < (n-l)Ap]}, 
where 7^ denotes the Ith model having pj predictors, I — 1, . . . , Nj, with Nj = ) . 

REMARK 2. The mode of convergence of P(M.i\Y n ) under M\ is the same as that of 
the associated conditional Bayes factors. 

4. POSTERIOR COMPUTATION 

We propose an MCMC algorithm for posterior computation, which combines a stochastic 
search variable selection algorithm (George and McCulloch, 1997) with recently proposed 
methods for efficient computation in Dirichlet process mixture models. In particular, we 
utilize the slice sampler of Walker (2007) incorporating the modification of Yau et al. (2011). 
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Using Sethuraman's (1994) stick-breaking representation, let 

oo 

P = WjOtj, oij ~ N(0, t -1 ), Wj = Vj J^[(l — v{), vi ~ Beta(l, m). 

3=1 Kj 

The slice sampler of Walker (2007) relies on augmentation with uniform latent variables as 
follows 

fw,a(y) = ^2 N (y\ a i)i Bw< y u ) = & '■ w i > u } ■ 
jeB w (u) 

For sampling the DP precision parameter, we specify the prior m ~ Ga(a m ,b m ), as such 
a hierarchical specification is likely to ensure better performance by increasing the support 
of the prior. Further, we assume equal prior inclusion probability for all predictors, i.e. 
7r(7 fc = 1) = |, k=l, . . . ,p. We outline the posterior computation steps briefly below: 
Step 1.1: Update the v's after marginalizing out the augmented uniform variable using 
7i"KH = Be(l + n h ,Y Jj>h n i + m )- 

Step 1.2: Update the augmented uniform variables from its full conditional as described in 
Walker (2007). 

Step 2: Update the allocation of atoms to different subjects using f(y i \u i ,S i — h) oc 
N(yi\a h , x 7>i , /3 7 , r- l )I(h e B w (ui)), h=l,. . . ,M 

Step 3: Update the precision parameter of the DP using ir(m\—) = Ga(a m + M,b m — 
Ylb=i l°g(l — v i))i where M is the number of clusters in the particular iteration. 
Step 4-' Letting p 7 be the dimension of the current model, update r -1 using 
t^t- 1 !-) = Ga(*&, \{{Y n - X 7 /3 7 )'S^(r« - X 7 /3 7 ) + fflXp^Xj^} ) . 
Step 5: Using the hyper-g prior and the fact that ^ ~ Be(l, a/2 — 1), we can subsequently 
adopt the griddy Gibbs approach (Ritter and Tanner, 1992) to update g. 
Step 6: For variable selection, we update 7 J 's one at a time by computing their posterior 
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inclusion probabilities after marginalizing out (3 and conditional on inclusion indicators for 
the remaining predictors as well as g, r _1 and A. Denoting 7(7) as the vector of variable 
inclusion indicators with 7 J = 1, and p 7(j -) as the vector sum of we can sample 7 J from 
the Bernoulli conditional posterior distribution with probabilities 

= 1|-) oc (1 + sr*W 2 exp { — ^(y-'^X^X'^ } . 

Step 7: Set {(3j : ^ = 0} = and update /3 7 = {fa : ^ = 1} using tt(/3 7 |-) = A(/3 7 ; E, V), 
where V = ^(X;S A 1 X 7 ) + r^A^X,)^ and £ = V (t^X^Y" - afj . 

5. SIMULATION STUDY 

We present the results of two simulation studies to demonstrate the utility of Bayesian 
variable selection in semi-parametric linear models. For the first case (Case I), the truth 
was generated from a linear model involving ten predictors with coefficients (3 2-10 1.5 1 
-4 -1.5 0) and a bimodal residual specified by 0.5*N(2.5,1)+0.5*N(-2.5,1). For the second 
case (Case II), the truth was generated from a normal linear model with the same set of 
regression coefficients and intercept=l. The covariates were generated independently from 
uniform(-l,l) distribution. For each case, we generated 20 different replicates for each of the 
sample sizes 100, 200, 300, 400 and 500, and summarized the results across the replicates. 

After generating the data in such a manner, we compared the performance of our method 
using marginal inclusion probabilities for each predictor (given by P((3j 7^ 0|F n ), j=l, . . . ,p), 
with the normal linear model having /3 7 ~ N(0, gT~ l (X' A=ln ^X A= i n ^)~ v ). This prior on /3 7 
is a special case of the SLM with A=l„, and is an attempt to assign comparable prior 
information to both the methods. The replicate averaged marginal inclusion probabilities of 
each predictor are reported across different sample sizes. We used Ga(0.1, 1) prior on the DP 
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precision parameter. Further, we chose a Be(l, 1) prior for which corresponds to a=4 in 
the hyper-g prior. For the griddy Gibbs approach, we chose 1,000 equally spaced quantiles 
from Be{\, 1) distribution. We made 50,000 runs with a burn in of 5,000. 

As the sample size increases, it is interesting to see how the marginal probabilities of 
inclusion for important predictors and the marginal probabilities of exclusion for unimportant 
predictors change. The marginal inclusion probabilities under both the methods were 1.00 
for (3\ = 3 and pg = —4 for all the sample sizes. For the remaining predictors, the plots of the 
marginal inclusion probabilities over different sample sizes are presented (Fig 1 and Fig 2), 
as a comparison between the two methods. These plots depict a faster rate of increase of the 
marginal inclusion probabilities of the important predictors for the semi-parametric Bayes 
method when the true residuals are non-Gaussian and a similar rate of increase under both 
methods when the true residuals are Gaussian. In contrast, for the unimportant predictors 
the exclusion probabilities converge to one slowly for both the methods, reflecting the well 
known tendency for slower accumulation of evidence in favor of the true null. 

To get a closer look when the true residual is non-Gaussian (Case I), we present the results 
for the sample size 100. As a comparison, we also present regression estimates under the lasso 
(Tibshirani, 1996) and elastic net (Zou and Hastie, 2005), using the GLMNET package in R 
with default settings. The average mean square error for out of sample prediction for a test 
sample size of 25 under the semi-parametric linear model was 7.7 compared to 15.4 under the 
normal linear model, implying a 50% reduction. The average out of sample MSE were 7.68 
for lasso (LI) and 7.65 for elastic net (EL). Out of the 20 different replicates generated with 
sample size 100, the normal linear model (NLM) chose the wrong subset of predictors 16 
times under the median probability model, whereas the semi-parametric linear model (SLM) 
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made incorrect variable selection decisions for 3 out of 20 replicates. The computation time 
for SLM per iteration was marginally slower than NLM, with the difference inreasing as 
the number of clusters increase. The mixing for the fixed effects was good under both the 
methods. The results for SLM do not appear to be sensitive to the hyper-parameters in 
7r(m), but are mildly sensitive to hyper-parameters in n(g) for n=100. 

Table 1 summarizes results for the model averaged regression estimates (ft), including 
95% pointwise credible intervals (C.I.) and marginal inclusion probabilities (MIP). The SLM 
and NLM correctly identify the important as well as unimportant predictors. In general, 
the LI and EL results seem to be unstable with the coefficients shrunk to varying over 
different replicates. As a result, the replicate averaged estimates for LI and EL in Table 1 
lead to inaccurate estimates of /3 4 and f3 7 . For the estimation of the fixed effects, the MSE 
around the true ( Mzfe^k ) was q.015 for SLM, 0.084 for NLM, 0.047 for elastic net and 
0.047 for lasso. Thus, the SLM results in more accurate estimates with narrower credible 
intervals. From the results, it is clear that when the true residual is non-Gaussian, the SLM 
not only has a more desirable performance in variable selection and estimation of regression 
coefficients, it also has a superior out of sample predictive performance as compared to NLM. 

6. APPLICATION TO DIABETES DATA 

The prevalence of diabetes in the United States is expected to more than double to 48 
million people by 2050 (Mokdad et. al., 2001). Previous medical studies have suggested 
that Diabetes Mellitus type II (DM II) or adult onset diabetes could be associated with 
high levels of total cholesterol (Brunham et. al., 2007) and obesity (often characterized by 
BMI and waist to hip ratio) (Schmidt et. al., 1992), as well as hypertension (indicated by 
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a high systolic or diastolic blood pressure or both) which is twice as prevalent in diabetics 
compared to non-diabetic individuals (Epstein and Sowers, 1992). However, most of these 
results rely on informal treatment of data and lack rigorous statistical analysis to support 
their conclusions. 

We develop a comprehensive variable selection strategy for indicators of DM II based on 
data obtained from Department of Biostatistics, Vanderbilt University website, involving a 
diabetes study for African- Americans. Our primary focus is to discover important indicators 
of DM II by modeling the continuous outcome, glycosylated hemoglobin (> 7mg/dL indicates 
a positive diagnosis of diabetes) based on predictors such as total cholesterol (TC), stabilized 
glucose (SG), high density lipoprotein (HDL), age, gender, body mass index (BMI) indicator 
(overweight and obese with normal as baseline), systolic blood pressure (SBP), diastolic 
blood pressure (DBP), waist to hip ratio (WHR) and postprandial time indicator (PPT) 
(0/1 depending on whether the blood was drawn within 2 hours of a meal). In addition 
to the factors already noted above (total cholesterol, obesity and hypertension) for DM II, 
we note that lower levels of HDL have been known to be associated with insulin resistance 
syndrome (often considered a precursor of DM II with a conversion rate around 30%), and 
further we also expect PPT to be a significant indicator as blood sugar levels are high up to 
2 hours after a meal. 

After trimming the records containing missing values, the data consisted of 365 subjects 
which was split into multiple training and test samples of sizes 330 and 35 respectively. The 
replicate averaged fixed effects estimates (multiplied by 100) for the SLM, NLM, lasso (LI) 
and elastic net (EL) are presented in Table 2, along with the marginal inclusion probabilities 
(MIP) for the SLM and the NLM. We also evaluate the out of sample predictive performance 
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for each training-test split using predictive MSE for SLM, NLM, LI and EL in Table 3, and 
additionally provide coverage (COV) and width (CIW) of 95% pointwise credible intervals 
for SLM and NLM. The same values of hyper-parameters were used as in section 5 for SLM 
and NLM. For each replicate, we randomized the initial starting points and made 100,000 
runs for SLM (burn in = 20,000) and 50,000 runs for NLM (burn in = 5,000). 

It is interesting to note from Table 2 that SLM tells a quite different story compared to the 
NLM in terms of variable selection. In particular, while both the models successfully identify 
total cholesterol, stabilized glucose and postprandial time as important predictors, it is only 
the SLM which identifies systolic hypertension (MIP = 0.77) and waist to hip ratio (MIP 
= 0.98) as important positively associated indicators, whereas NLM fails to identify these 
factors (MIP = 0.18 for SBP and 0.17 for WHR) and instead throws in age (MIP = 0.77) as 
an important predictor. Further, SLM points to a more significant negative association with 
HDL (MIP=0.64) as compared to NLM (MIP=0.52). For both the methods, the marginal 
inclusion probabilities for BMI (overweight and obese) were low, which could potentially be 
attributed to adjusting for the other factors such as waist to hip ratio. The lasso and elastic 
net include all predictors except DBP, and hence produce an overly complex model. 

Variable selection in this application is clearly influenced by the assumptions on the 
residual density, with the nonparametric residual density providing a more realistic charac- 
terization that should lead to a more accurate selection of the important predictors. Figure 
3 show an estimate of the residual density obtained from the SLM analysis, suggesting a uni- 
modal right skewed density with a heavy right tail. The SLM results suggest that a mixture 
of two Gaussians provides an adequate characterization of this density. The computation 
time for SLM is only marginally slower than NLM, and in addition SLM exhibits good mix- 

19 



ing for most of the fixed effects (Table 4). These results are robust to SSVS starting points, 
and consistency in the results across training-test splits also indirectly suggests adequate 
computational efficiency of SSVS. 

In terms of out of sample predictive MSE (Table 3), none of the models is a clear winner, 
with the relative performance varying across training-test splits. The MSE's for lasso and 
elastic net are very similar to NLM, except for the second test sample where they have 
the lowest MSE. Overall, the NLM has narrower 95% pointwise credible intervals compared 
to SLM, often resulting in poorer coverage. Thus, in conclusion, although the competitors 
yield comparable out of sample predictive performance, it is only the SLM which succeeds in 
choosing the most reasonable model for DM II, consistent with previous medical evidence. 

7. DISCUSSION 

We develop mixtures of semi-parametric g-priors for linear models with non-parametric resid- 
uals characterized by DP mixtures of Gaussians. The proposed method addresses the often 
encountered issue of non-Gaussianity of residuals in variable selection settings, and has at- 
tractive asymptotic justifications such as Bayes factor and variable selection consistency 
involving fixed p as well as p > n (under some restrictions on the model space). Further, 
the method is essentially no more difficult to implement than SSVS for normal linear models 
and can lead to substantially different conclusions, as illustrated in the diabetes application. 
The general topic of semi- and nonparametric Bayesian model selection is understudied and 
we hope that this work stimulates additional research of this type in broader model classes, 
such as for generalized linear models and nonparametric regression. 
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APPENDIX: PROOF OF RESULTS 

Proof of Theorem I: Using similar methods as in the proof of theorem 2 in Guo and 

Speckman (2009), it can be shown that conditional on A and assumptions (A3) and (A4), 

r ~ I ~ ra / 2 

the upper and lower bounds of I\ = / °°(1 + g)~ Pl ^ 2 1 — j^R\,i n (dg) are 

Pl + 2k u W 2 + fc Yi-#iiY l/2+fe v n \- n ' 2 / \^/ 2 



h ~ {n- Pl -2kJ { R\ x ) [ n - pi -2k u ) V 1 ^ 

w U-P1-2J l^rJ y 1 -^) = UaAu1 

and h > n -' Pl / 2 - k ( 1 - R 2 Al 1 = L Ajl (n). Similarly 

POD 

LaM < h = I (i + g)- p2 

Jo 



l--°-R 2 A2 
1 + g 



-| -n/2 

n(dg) < U A ,2(n). 



Therefore, BF" M < ^ 



) 

„ , 01, \ P2/2+k u /1 _ 62 \ p 2 /2+k u / \ -n/2 

p 2 + LK U \ j 1 -(1,4,2 \ 



n-p 2 -2k u J V #4,2 

Case (I): For fixed p, directly from the proof of Theorem 3 in Guo and Speckman (2009) 

/ 1 — R 2 \ ~ n l 2 

BF^ A < ((A, n) = n^+k-K I un der M 1 for all A e C^. (12) 

V 1 ~ R A,J 

Further, BF" M < ((A, n) L(Y n \ A, M 2 ) < ((A, n)L(Y n \ A, 

^L(Y n \M 2 ) < J2 ^C(A,n)^(^"|A,Mi)<max A6C „C(A^L(y n |Mi).(13) 
Aiec n 
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In the limiting sense as n — > oo, the maximum in the upper bound in (13) is computed over 



AG Coo. From (12), ((A,n) -)■ under Mi for all Ae implies max A( z Coo ((A, n) -)■ 0. 



Dividing both sides of (13) by L(F n |A^i), this implies BF^ — >0 under Mi. Further, the 
mode of convergence of BF^ is the same as BF^ 21 , and the rest follows from the proof of 
Theorem 3 in Guo and Speckman (2009). 

Case (II): For increasing model dimensions pi = 0(n ai ) and p 2 = 0(n a2 ) with < ai < 



a 2 < 1, for g ~ ir(g) we will only assume (A3) so that k u = 0. We have using (11) 



BF? 1>A < n^ l - a ^' 2+k ( - „f A A P2/2 f 1 R f 2 ) ~ n '\ (14) 

V R A2 / VI — R At iJ 

Let us consider the following cases under < a\ < a 2 < 1. 

Case CI: Mi C M 2 . We have Qj = t(Z' a Z a - Z' A H Aj Z A ) ~ xl- Pj (fy, j=l,2, and Q 1 -Q 2 = 
t(z' a (H A)2 - H A) i)Z A j ~ X 2 P1 - P M- Usin S Lemma 1 of Guo et. al. (2009), 

1 ~ R A,2 _ Z' A Z A - Z' A H A)2 Z A _ Q2 _ 1 _ (Ql ~ Q-l)l{V2 - Pi) P2 - Pi a.s. 

1 - R 2 Al ~ Z' A Z A - Z' A E AA Z A ~ Qi~ Qi/(n -pi) n-pi ' 



- 7 2 



Moreover I ^,7 A ' 2 J ~^ [J>a~i J un der A^i, which implies that y pp A ' 2 J blows up at a rate 
strictly slower than the rate at which n Pi/ 2 -( 1 - a 2)P2/2+fc q ^his implies that BF^ A -4- 
under Mi, for all AG C^. 
Case C2: .Mi % M 2 - Using Lemma 1, 

1 ~ R A,2 a.s. T' 1 + b A i - b A2 1 - R AA a.s. T" 1 

— ~ > : , ~ > — : : : — , under M\. 

R\ 2 b A>2 l~R\ 2 T" 1 + b AA - b A , 2 ' 



p 2 /2 , -„ \ -n/2 



For fixed r" 1 and 6,4,2 > (under (A2)), I R2 A ' 2 j I x _^ 2 j 0. In addition, we 

have pi — (1 — a 2 )p 2 + k < for < ai < a 2 < 1, which implies BF^ A — > under M\. 
Subsequently using similar arguments as in Case (I), BF^ ^4' under Mi for both CI, C2. 
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Proof of Theorem II: For fixed p, the result is trivial to prove using Theorem I. 
For increasing p (> pi), n 71 ^) oc 2~ p ^ 2 I[pj < 0(n ai )] + Nj l I[0{n ai ) < pj < (n - 1) Ap], 
I — 1, . . . , Nj, Nj — ( p ) . Let BF J;1 = Bayes factor between models 7^ and Mi and H n (ai) : = 

{j : 0(n*)< Pj < (n- 1) Ap}. Then P^Y") = [l+2 Pl/2 E j:p , e ^ (ai) X)£i ^^^il" 1 - 
From the preceeding proof of Theorem I, the upper bound for {BF™ ;1 : 0(n ai ) < pj < (n — 1) A p) 
for large n is [7", given by (a) [7" « KPj/ 2 n~( 1 ~ a i) p i/ 2+Pl / 2+k for < k < 00, for the nested 
case (b) UJ < n~( 1 ~ a ^ Pj / 2+pi / 2+k , f or non-nested case, with k> 0. Therefore 

P(Mi\Y n ) > [l + 2 Pl/2 Zl^?/^]" 1 = [l + 2 Pl/2 Z t/j 1 ] -1 

j:pj-eH"(oi) Z=l j: Pj eH»( ai ) 

> [1 + n2 Pl / 2 maXj. Pje ^n( ai )f7j 1 ] _1 -> 1 as n->oo, using (a), (b), and Theorem I. 

Further, the mode of convergence under Mi is the same as that of the associated conditional 
Bayes factors. 
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Table 1: Results for Case I when n=100. SLM: Semi-parametric linear model, NLM: 
Normal linear model, LI: Lasso, EL: Elastic Net. MIP: Marginal Inclusion Probability. 



f^True 


MIPsLM 


MIPjvlm 


PSLM 


Pnlm 


0L1 


Pel 


3 


0.99 


0.99 


2.89 (2.22,3.44) 


2.74 (1.73,3.57) 


2.93 


2.92 


2 


0.98 


0.94 


1.84 (1.37,2.55) 


1.65 (1.21,3.28) 


1.61 


1.61 


-1 


0.89 


0.59 


-0.84 (-1.47,-0.32) 


-0.59 (-1.95,-0.03) 


-1.18 


-1.17 





0.19 


0.25 


-0.02 (-0.41,0.33) 


-0.01 (-0.52,0.45) 


0.21 


0.21 


1.5 


0.97 


0.85 


1.39 (0.89,2.08) 


1.24 (1.25,3.12) 


1.9 


1.89 


1 


0.91 


0.58 


0.83 (0.27,1.42) 


0.61 (-0.12,1.11) 


0.92 


0.92 





0.24 


0.28 


-0.05 (-0.45,0.28) 


-0.06 (-0.72,0.29) 


-0.22 


-0.23 


-4 


1.00 


1.00 


-3.86 (-4.39,-3.16) 


-3.64 (-4.42,-2.52) 


-4.05 


-4.04 


-1.5 


0.95 


0.77 


-1.34 (-2.03,-0.81) 


-1.14 (-2.11,-0.01) 


-1.33 


-1.34 





0.21 


0.30 


-0.02 (-0.41,0.31) 


-0.04 (-0.59,0.41) 


0.07 


0.07 
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Table 2: Fixed effects (times 100) and marginal inclusion probabilities (MIP). 

SLM: Semi-parametric linear model, NLM: Normal linear model, LI: Lasso, EL: Elastic Net. 



Predictor 


PSLM 


(3nlm 


hi 


Pel 


MlPsLM 


MIPjvlm 


TC 


0.48(0.12, 0.84) 


0.63( 0.17, 1.07) 


0.75 


0.75 


0.98 


0.99 


SG 


2.16 (1.81, 2.52) 


2.84 ( 2.52, 3.17) 


2.83 


2.82 


1.00 


1.00 


HDL 


-0.48 (-1.33, 0.02) 


-0.52 (-1.66, 0.03) 


-1.02 


-1.02 


0.64 


0.52 


Age 


0.51 ( 0, 1.64) 


1.30 ( 0.01, 2.56) 


1.19 


1.19 


0.35 


0.77 


Gender 


-3.05 (-28.83, 6.98) 


-1.90 (-26.34, 7.14) 


-19.66 


-19.81 


0.21 


0.16 


BMI(overwt) 


0.82 (-7.54, 18.52) 


2.04 (-5.30, 29.59) 


4.33 


4.27 


0.15 


0.17 


BMI(obese) 


(-13.12, 13.00) 


-1.37 (-24.83, 9.13) 


-14.88 


-15.03 


0.15 


0.15 


SBP 


0.45 (-0.02, 1.24) 


0.04 (-0.19, 0.71) 


0.25 


0.25 


0.77 


0.18 


DBP 


-0.03 (-0.94, 0.61) 


(-0.58, 0.56) 


0.018 


0.017 


0.19 


0.14 


WHR 


211.74 (40.02, 361.41) 


4.72(-53.12, 102.12) 


90.47 


91.53 


0.98 


0.17 


PPT 


20.62(0, 56.13) 


32.13 (0, 75.89) 


47.31 


47.32 


0.62 


0.77 



Table 3: Out of Sample Prediction. (Cov: 95% C.I. coverage, CIW: 95% C.I. width) 



MSE 5LM 


Cov(SLM) CIW(SLM) MSE 5LM C 


ov(NLM) CIW(NLM) 


MSE L1 MSE EL 


Sample 1 


1.27 


97.14 


6.94 


1.23 


97.14 




5.91 


1.36 


1.24 


Sample 2 


4.67 


94.28 


6.23 


4.67 


91.42 




5.40 


1.21 


1.20 


Sample 3 


1.55 


100.00 


6.83 


1.78 


94.28 




5.82 


1.75 


1.75 


Sample 4 


1.22 


97.14 


6.77 


1.26 


97.14 




5.91 


1.24 


1.23 


Sample 5 


1.42 


100.00 


6.79 


1.16 


100.00 




5.92 


1.17 


1.18 


Sample 6 


1.46 


100.00 


6.79 


1.43 


97.14 




5.90 


1.52 


1.52 


Sample 7 


3.70 


91.42 


6.47 


3.40 


91.42 




5.59 


3.38 


3.38 


Sample 8 


1.24 


100.00 


6.79 


1.50 


97.14 




5.87 


1.54 


1.53 


Table 4: Auto-correlations across lags for fixed effects. 










Predictor 


Lag 1 


Lag 5 




Lag 10 


Lag 21 




Lag 50 








SLM 


NLM SLM 


NLM 


SLM NLM 


SLM 


NLM 


SLM 


NLM 




TC 


0.22 


0.18 0.113 


0.194 


0.073 0.159 


0.032 


0.111 


0.013 


0.059 




SG 


0.59 


0.06 0.386 


0.038 


0.285 0.022 


0.14 


0.009 


0.06 0.016 




HDL 


0.19 


0.02 0.081 


0.012 


0.041 0.013 


0.01 


0.021 


0.0005 


-0.006 




Age 


0.21 


0.04 0.072 


0.009 


0.053 -0.0001 


0.025 


0.006 


0.007 


-0.014 




Gender 


0.06 


-0.007 0.030 


0.0003 


0.013 -0.006 


0.009 


-0.014 


0.005 


0.019 




BMI(overwt) 


0.02 


-0.002 0.01 


-0.006 


0.006 0.013 


-0.006 


0.009 


0.0014 


0.018 




BMI(obese) 


0.02 


0.002 0.017 


0.004 


0.004 0.018 


0.007 


-0.003 


0.000 


0.000 




SBP 


0.29 


0.0711 0.137 


0.019 


0.096 0.007 


0.047 


0.03 


0.014 


0.022 




DBP 


0.07 


0.0239 0.021 


0.019 


0.019 0.031 


0.009 


-0.003 


0.004 


-0.012 




WHR 


0.44 


0.0642 0.353 


0.043 


0.321 0.061 


0.251 


0.06 


0.186 


-0.003 




PPT 


0.22 


0.0600 0.118 


0.047 


0.068 0.045 


0.015 


0.004 


-0.002 


0.019 
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Figure 1: Marginal Inclusion Probabilities (MIP): Truth generated from bimodal residual. Solid 
lines - Semi-parametric Linear Model, dashed lines - Non-parametric Linear Model. 
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Figure 2: Marginal Inclusion Probabilities (MIP): Truth generated from Gaussian residual. Solid 
lines - Semi-parametric Linear Model, dashed lines - Non-parametric Linear Model. 
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Figure 3: Residual plots for Diabetes study for Semi-parametric Linear Model 
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